In [1]:
from abydos.phonetic import *
from abydos.distance import *
import pandas as pd
The we load some data into a DataFrame. In this case, we'll load the US Census surnames data ranked by frequency.
In [2]:
names = pd.read_csv('../tests/corpora/uscensus2000.csv',
comment='#', index_col=1, usecols=(0,1), keep_default_na=False)
names.head()
Out[2]:
We can create a dictionary of Soundex values mapping to all the surnames with the same Soundex code. These represent Soundex collisions (or blocking). Getting the basic Soundex value of a string is as simple as calling soundex()
on it.
In [3]:
soundex('WILLIAMSON')
Out[3]:
Better yet, we can construct a Soundex()
object to reuse for encoding multiple names.
In [4]:
sdx = Soundex()
reverse_soundex = {}
for name in names.name:
encoded = sdx.encode(name)
if encoded not in reverse_soundex:
reverse_soundex[encoded] = set()
reverse_soundex[encoded].add(name)
With this dictionary, we can retrieve all the names that map to the same Soundex value as, for example, the name Williamson.
In [5]:
reverse_soundex[soundex('WILLIAMSON')]
Out[5]:
We can build up a DataFrame with some interesting information about these names. First, we'll just collect all the names in a column.
In [6]:
df = pd.DataFrame(sorted(reverse_soundex[soundex('WILLIAMSON')]), columns=['name'])
df
Out[6]:
To that, let's add a few distance measures.
In [7]:
# Levenshtein distance from 'WILLIAMSON'
lev = Levenshtein()
df['Levenshtein'] = df.name.apply(lambda name: lev.dist_abs('WILLIAMSON', name))
# Jaccard similarity on 2-grams
jac = Jaccard()
df['Jaccard'] = df.name.apply(lambda name: jac.sim('WILLIAMSON', name))
# Jaro-Winkler similarity
jw = JaroWinkler()
df['Jaro_Winkler'] = df.name.apply(lambda name: jw.sim('WILLIAMSON', name))
And finally, we'll add a few phonetic encodings.
In [8]:
# Double Metaphone (first code only)
dm = DoubleMetaphone()
df['Double_Metaphone'] = df.name.apply(lambda name: dm.encode(name)[0])
# NYSIIS
nysiis = NYSIIS()
df['NYSIIS'] = df.name.apply(lambda name: nysiis.encode(name))
# Alpha-SIS (first code only)
alphasis = AlphaSIS()
df['Alpha_SIS'] = df.name.apply(lambda name: alphasis.encode(name)[0])
In [9]:
df
Out[9]:
Let's check the row for WILLIAMSON.
In [10]:
df[df.name == 'WILLIAMSON']
Out[10]:
In addition to their Soundex collision, 7 names have matching first Double Metaphone encodings.
In [11]:
df[df.Double_Metaphone == 'ALMSN']
Out[11]:
28 have matching NYSIIS encodings.
In [12]:
df[df.NYSIIS == 'WALANS']
Out[12]:
And 7 have matching first Alpha-SIS encodings.
In [13]:
df[df.Alpha_SIS == '45302000000000']
Out[13]:
6 names match in all four of the phonetic algorithms considered here.
In [14]:
df[(df.Alpha_SIS == '45302000000000') & (df.NYSIIS == 'WALANS') &
(df.Double_Metaphone == 'ALMSN')]
Out[14]: